Business Analytics

Advanced Data Visualizations

Ayush Patel and Jayati Sharma

13 March, 2024

Pre-requisite

You already….

  • Know basic and advanced data wrangling functions in R
  • Know basics of data visualization in R
  • Can write functions in R

Before we begin

Please install and load the following packages

library(dplyr)
library(tidyverse)
library(scales)
library(patchwork)
library(ggiraph)
library(gghighlight)
library(ISLR2)
library(openintro)



Access lecture slide from the course landing page

About me

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objectives

  • Learn annotation for graphs in R
  • Learn how to combine graphs
  • Learn scaling functions in R
  • Learn how to make ggplot graphs interactive

Let’s Recap

  • In the data visualization lecture, you learnt how to create various types of graphs using ggplot2
  • Some of them include bar, graph, line graph, scatter plots etc
  • For effective data visualization and communication, any plot requires modifications
  • These include annotations on the plot, modification of axes and scales, highlighting and interactivity of the plot
  • The aim of this lecture is to move beyond making graphs, towards clear and effective visualizations

Annotations in ggplot - Text

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • In addition to plotting your graph, you want to provide additional details to explain your graph
  • Text annotations are useful in this case
  • The annotate() function can be used for any kind of geometric object
  • In the annotate() function, the type of geom is specified first
  • Then, the positining is required (x and y coordinates in this case)
  • This is followed by the label
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 4, y = 25, label = "Annotation Text")

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Further, annotations can be customized
 ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 4, y = 25, label = "Annotation Text", colour = "orange", size = 8)

 ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("text", x = 1:5, y = 6, label = "Annotation Text", colour = "orange", size = 3)

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Similar to text annotation, other geoms can be used for annotations
  • However, instead of x and y, xmin and xmax is used for coordinates of the rectangle
  • Do you remember what alpha is used for?
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("rect", xmin = 4.8, 
            xmax = 5.7,
            ymin = 10,
            ymax = 18.6, 
            alpha = .2)

Annotations in ggplot

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • Suppose you want to add a line segment to your graph
  • annotate() over here requires x and xend coordinates
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
   annotate("segment", x = 4.8,
            xend = 5.7,
            y = 10,
            yend = 18.6,
            colour = "red")

Do it Yourself -1

  • Load the Auto data from ISLR2 package
  • Make a scatterplot of horsepower and acceleration
  • To the plot, add the text Horsepower vs. Acceleration
  • Add a rectangle to the plot, such that it covers the area where horsepower is higher than 200 but acceleration is still lesser than 15
  • Add a line to the plot from the coordinates (50,10) to (150,20)

Scales Functions in ggplot2 - Why?

  • When you create a graph, using ggplot2, the axes are mapped automatically based on the data
  • However, you would often need to change the axes in order to effectively present the data
  • the scale functions in ggplot2:
    • control how the data is plotted
    • allow manipulation of axes
    • improves overall appearances of the plot for effective data communication

Scales Functions in ggplot2

  • Look at the scatter plot of wt and mpg
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()

  • What if you want both the axes to start from 0?
  • scale_y_continuous() allows you to set the range for the y-axis
  • limits inside the scale_y_continuous() provides limits of the scale
  • Over here, NA is used to refer to the existing maximum
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(limits = c(0, NA))

Scales Functions in ggplot2

  • Instead of using NA, if you had to provide 40 as the limit for y-axis
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(limits = c(0, 40))

Scales Functions in ggplot2

  • Setting breaks in the scale_y_continuous allows you to set what intervals the axis will have
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_continuous(breaks = seq(0, 40, 7))

Scales Functions in ggplot2

  • Similarly, there are other transformations for scale - reversing the scale
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_reverse()

  • scale_y_log10() does log transformation of the scale
ggplot(mtcars, aes(x = wt, y = mpg)) +
   geom_point()+
  scale_y_log10()

Scales Functions in ggplot2

Content for this topic has been sourced from ggplot2. Please check out the work for detailed information.

  • The scale_colour_brewer() options are useful for plotting discrete values on your graph
  • The brewer scales provide sequential colour schemes from ColorBrewer
  • Look at the two charts
  • scale_colour_brewer helps in effcient mapping of discrete variables
ggplot(mpg, aes(x = displ, y = cty)) +
   geom_point(aes(colour = class))

ggplot(mpg, aes(x = displ, y = cty)) +
   geom_point(aes(colour = class))+
  scale_colour_brewer()

Do It Yourself -2

  • Using the Auto data, plot a scatterplot between y = weight and x = displacement
  • Set the x-axis with breaks as 50
  • Set the breaks for y-axis as 5
  • What variable according to you can be used as the colour for the points? How?

Scales Package

  • The scales package many scaling functions for visualizations
  • It allows for sophisticated customisation of data in a plot
  • Functions for readable and informative axes

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Look at the following chart made using txhousing data
  • We want to make it more readable and clear
  • The number of zeroes on the y-axis can be reduced along with the way years are represented on the x-axis
txhousing %>% 
  mutate(date = make_date(year, month, 1)) %>% 
  group_by(city) %>% 
  filter(min(sales) > 500) %>% 
  ggplot(aes(date, sales, group = city)) + 
  geom_line(na.rm = TRUE)

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Similar to the scale functions in ggplot2, the scales package has functions for breaks and labels
  • the breaks_width function provides a way to show every two years on the axis, while the label_date provides a way to show the last two digits of the year using %y, making it more clear
  • On the y-axis. the cut_short_scale() function removes the additional 0 and supplements the K sign
txhousing %>% 
  mutate(date = make_date(year, month, 1)) %>% 
  group_by(city) %>% 
  filter(min(sales) > 500) %>% 
  ggplot(aes(date, sales, group = city)) + 
  geom_line(na.rm = TRUE) + 
  scale_x_date(
    NULL,
    breaks = scales::breaks_width("2 years"), 
    labels = scales::label_date("'%y")) + 
  scale_y_log10(
    "Total sales",
    labels = scales::label_number(scale_cut = scales::cut_short_scale()))

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • Let us try modifying another graph using economics data
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line()

  • How can this be made more readable?
  • For x-axis, one option is to show the dates along with a few months, for better insights
  • For the y-axis, a label that adds the dollar sign would make the chart more readable

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • breaks_width sets intervals for 3 months
  • However, you might want the axes to have the date format in months
  • label_date_short() does the task of shortening the date lengths
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short())

Scales

Content for this topic has been sourced from scales. Please check out the work for detailed information.

  • For the y-axis, you can set breaks as you desire using breaks_extended()
  • label_dollar() adds a dollar sign to the y-axis
economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short()) + 
  scale_y_continuous("Personal consumption expenditures",
    breaks = scales::breaks_extended(8),
    labels = scales::label_dollar())

Do it Yourself -3

  • Load the tourism data from openintro
  • Make a line graph of year and tourist spending
  • Is there any change you could make to the chart for better readability?

Patchwork

  • You have made multiple by now and want to combine them into the same graphic
  • A very easy way to do this by using patchwork
  • Let us learn this using our recently made plots
p1 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line()

p2 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short())

p3 <- economics %>% 
  filter(date < ymd("1970-01-01")) %>% 
  ggplot(aes(date, pce)) + 
  geom_line() + 
  scale_x_date(NULL,
    breaks = scales::breaks_width("3 months"), 
    labels = scales::label_date_short()) + 
  scale_y_continuous("Personal consumption expenditures",
    breaks = scales::breaks_extended(8),
    labels = scales::label_dollar())

Patchwork

  • The usage of patchwork is very simple: you literally just add plots together!
p1 + p2

  • You can also put the plots one below the other
p1 / p2

Patchwork

  • While plots p1 and p2 show the intermediate steps, p3 is the final plot
  • It would be better to have the two at the top and the final one at the bottom
(p1 + p2) / p3

Patchwork

  • After combining all the plots, you would want to modify all plots at once
patchwork <- (p1 + p2) / p3
patchwork & theme_minimal()

Do it Yourself - 4

  • From the tourism data, make line charts of year and visitor_count_tho, one for each decade
  • Combine these charts in such a way that at the top, the graph for all years is displayed and below it, there are 5 charts, one for each decade

Highlight information - gghighlight()

Content for this topic has been sourced from Hiroaki Yutani’s work. Please check out the work for detailed information.

  • Run the following code to generate a dataset randomly
set.seed(2)
data <- purrr::map_dfr(letters, ~ data.frame(
      id = 1:500,
      value = cumsum(runif(500, -5, 5)),
      type = .,
      flag = sample(c(TRUE, FALSE), size = 500, replace = TRUE),
      stringsAsFactors = FALSE))
  • Suppose you want to plot the value of each id
ggplot(data) +
  geom_line(aes(x= id, y = value, colour = type))